Frogs are one of the most diverse species of amphibians in the world. There are a countless number of different types of frogs, each emitting their own unique sound and having their own unique look. Although frogs can be located in nearly any part of the United States, truly observing and understanding frog activity in specific areas can become extremely difficult, as frogs are often disturbed by the presence of humans. Therefore, this experiment utilized recorders to capture audio files in unique Michigan areas to try to understand how frogs are being impacted in these specific areas. Although these audio recorders would limit the effect human presence had on frog activity, actually extracting data from audio files can be an extremely difficult process. While this has been done before, it appears to have mainly been focused on bird activity, rather than frog activity (Marcelo). During this project, we were given numerous audio files of hour long frog calls in various locations for 2018 and 2021. These calls were gathered from 9 primary locations that were divided into 4 different treatment groups. Our overall objective was to analyze these audio files in R, and obtain count data for the frog activity within each audio file in order to statistically determine if there was a difference between regions that contained invasive species of wildlife, like cattails, versus areas that did not have these invasive species present. While we did not quite get to the point where such analysis could be possible, we did make great headway into making models that could accurately count frog calls for each given species in an audio file, provided some labeled templates. For our project in particular, we focused on the Northern Leopard Frog(NLF), as it’s call is highly distinguishable within the audio files. In order to do this in R, we utilized a family of packages developed by Marcelo Arya-Salas that was designed to automatically pick out bird calls from a given audio file. This included packages WarbleR, OHun, and RRaven.
To begin this analysis, we initially looked at building spectrums and spectrograms of the audio files. A spectrum graphically displays the contents of an entire signal. It is a function of amplitude (vertical axis) vs. frequency (horizontal axis). While this may be useful in analyzing one specific signal of a frog, this would not be a good method in seeing the total number of signals in an audio file. Therefore, spectrograms became the next focus of this project. While a spectrum operates as a graphical photo of a single signal, spectrograms combine many spectrums together to form a large graph of all the signals over a specific time period. It takes spectrums to the next level, as spectrograms show how frequency content of a signal changes over time. They are a function of amplitude (brightness) vs. frequency (vertical axis) vs. time (horizontal axis). To build spectrograms, we utilized a R package titled ‘WarbleR’, as well as other audio file packages, such as ‘audio’, with their specific ‘Spectrogram’ commands. We understood that each frog species would generate its own unique shape within these spectrograms, which would allow us to extract the number of times each shape appeared in them. In theory, this would allow us to generate count data for each frog species within these audio files. While ultimately this proved to be not worthwhile, it was valuable in terms of learning how to work with audio files in general.
For this project, we focused manily on the combination of two R packages, WarbleR and Ohun. Both these packages were developed by Marcelo Arya-Salas, as he utilized R to automatically pick out bird calls from a given audio file. As these R packages worked in unison with one another, we explored two seperate routes when using both of them: Energy based detection and template based detection. Both of these methods are detailed below, as well as are outlined in the graphic below.
knitr::include_graphics('Screen Shot 2021-12-18 at 9.46.07 PM.png')
Energy-based detection is a detection method for finding the times the amplitude of the signal passes a defined threshold in a given frequency range. Though this method is much easier to use, it doesn’t fare as well in audio samples that have a much higher signal-to-noise ratio. Also, calls from different species that have overlapping frequency ranges can be mistaken for each other in energy-based detection. Based on the advice from Marcelo Araya-Salas, whose packages we have been using, we decided to use template-based detection instead.
Utilizing the Ohun package, we were finally able to get some headway in collecting data. Template-based detection works well with highly stereotyped signals, and some of the species of frog calls exhibited this, like the Northern Leopard frog. As seen in the spectrogram image below, the spectrogram of the NLF is unique enough to be able to be distinguished from other sounds and is able to be picked out quite easily as it consists of a distinct pattern of frequencies. Unlike energy based detection, template detection offers the great upside of being robust to background noise. This works out better for our audio files as a whole, as a vast majority of them had high signal to noise ratios, indicating that there was a lot of background noise that would throw off energy based detection.
knitr::include_graphics("Screen Shot 2021-12-18 at 10.18.24 PM.png")
knitr::include_graphics('Screen Shot 2021-12-18 at 10.06.48 PM.png')
First, template based detection would be fed either a singular template, or a series of template sounds through a .csv file that contains the following features: Sound file name, selection number, start time, end time, bottom frequency and top frequency. An example of this is shown in image 2. This template would then be converted into a selection table object, through the function selection_table. Correlations are then generated for the desired audio file and template it is fed using the function template_correlator(). Template_correlator detection utilizes time-frequency cross correlation in order to generate a correlation score for each template and each sound file. Template_correlator is then used in conjunction with spectrogram graphs to explore correlation peaks and actual frog call locations. Alternatively, since this is a manual process, one can utilize template_detector() function which locates all the correlations above a certain designated threshold.
Though adjusting the correlation threshold for template-based detection works well for finding the calls that you are looking for, it will also find many calls that are not actually present in the audio file (false positives). To reconcile this, we compared our outputs from the template-based detection with labeled calls from the same time and used a Random Forest to distinguish the true positives from the false positives. A Random Forest is a model that combines many trees together to make a decision given the inputs. A tree selects predictor variables, according to how much information it provides and according to a chosen function, in classifying an outcome (in our case true positives and false positives). The Random Forest trains many trees on bootstrapped (randomly sampled with replacement) data sets and makes a prediction according to the consensus across trees. This is valuable for making a model that generalizes well to unseen data. Our Random Forest was trained on using both frequency domain and mel-frequency cepstral (MFC) domain summaries of the different calls. The summaries of the frequency domain come from the Fourier Transform of the original signal, and include summary statistics about the frequency like mean, quantiles, kurtosis, and entropy, among others. The MFC domain summaries comes from the inverse transform of the log of the Fourier Transform of the original signal, and includes information about the minimum, maximum, mean, median, skewness, kurtosis and variance of the MFC. Utilizing random forests would allow us to reduce the amount of false positives that our template-detection model generates while keeping most of the true positives intact.
To begin the analysis, we first generated spectrograms for a number of different audio files in order to visualize the amount of times specific frog signals appeared. From there, we would want to extract the number of times these signals appeared for each frog to generate count data that could be utilized for analysis. However, this process became extremely difficult and inefficient. We first analyzed the audio files titled “S4A07474_20180802_120000.wav” and “S4A07071_20210518_220000.wav” in order to have two separate spectrograms generated for two independent audio files. To generate these spectrograms, the packages of ‘audio’ and ‘RCurl’ were utilized, as these packages contained a unique ‘Spectrogram’ function. However, these two spectrograms produced very interesting results. First, a spectrogram was generated for the audio file ‘S4A07474_20180802_120000.wav’(Figure 1). In this spectrogram, although we can see much activity within the spectrogram, it became nearly impossible to try to distinguish each frog signal, as the unique shapes of the frog signals were not clearly present. This is even more evident for the second spectrogram generated for the audio file “S4A07071_20210518_220000.wav” (Figure 2). In this spectrogram, there appears to be a constant signal throughout the entirety of the file. While this possibly could be an example of the spring peepers, this signal overlaying the entirety of the final once again makes it nearly impossible to try to distinguish counts. These spectrograms were both generated with the default axes of hertz for frequency and milliseconds for time, which proved to be very inefficient for trying to analyze hour-long audio files. However, this package becomes extremely buggy when attempting to adjust the ranges of the axes, preventing us from being able to focus on specific times and frequency ranges within the audio file. This package also does not include an auto detect feature to analyze the spectrograms, meaning that one would have to manually look at each generated spectrogram for every audio file and identify the unique shapes for each signal, which is very inefficient. Therefore, we chose to focus on different methods in the packages of Ohun and WarbleR.
knitr::include_graphics('Screen Shot 2021-12-18 at 5.25.37 PM.png')
knitr::include_graphics('Screen Shot 2021-12-18 at 5.25.53 PM.png')
Our results utilizing template based detection showcased a high level of sensitivity, but an extremely low specificity. Initially, After feeding the template_detector a selection table that contained 7 separate NLF call templates from a hand labeled audio file in 2018, we were able to pick out all 17 of the NLF calls in a separate hand labeled audio file with a correlation threshold of 0.13. Initially, we first utilized a correlation threshold of 0.15, as this threshold still correctly picked out 16 of 17 labeled calls, while having a higher specificity than the previous thresholds. However, later in the process we realized that a specificity of 0.94 initially just simply was not high enough for the random forest step discussed later. In order to retain a sensitivity of 1, we ultimately decided to choose a correlation threshold of 0.13. The need for such a sensitivity is also the reason why the correlation threshold is so low, as moving this threshold any higher would greatly tank the template detector’s ability to correctly pick out true positives. Marcelo also advised us that correlation thresholds were fine, and that picking a threshold that retained a high sensitivity was the most important thing for building an accurate model. When having a correlation threshold of 0.13, the template detector was able to pick out all 17 labeled NLF calls found in the audio file. This was highly promising, as it indicates that template-based detection could reliably pick out the specified calls, even using templates from a completely unrelated audio file. This indicates that automation might be able to be reliably done in the future, and not require hand-labeled templates for each audio file. However, while it was picking up the labeled calls, it was also picking up thousands of false positives as well. This is best displayed in image 5, where you can see that specificity is extremely low at 0.00078511, with over 21 thousand false positives. However, when talking to Marcelo, he said this problem was fine, as the false positives could later be thinned out through the use of a random forest.
knitr::include_graphics("Screen Shot 2021-12-18 at 9.41.52 PM.png")
Utilizing the packages in rangers, along with functions spectro_analysis() and mfccs(), we were able to build a random forest model that greatly cut down the number of false positives. We went from having over 21 thousand false positives to just 6, while barely decreasing our sensitivity. Initially, when utilizing the correlation threshold of 0.15, our random forest model was extremely poor at picking out true positives, with a sensitivity of 0.647. Despite greatly cutting down on the number of false positives, this model was too poor at accurately picking out true positives. When utilizing a threshold of 0.13 however, we were able to build a random forest model that had 0.941 sensitivity, with only a slight decrease in specificity at 0.727. This proved to be a much more accurate model in picking out the NLFs.
In order to generate counts for the NLF, we decided to use the random forest model generated in order to try and accurately predict the number of NLF frogs calls for an hour long audio file. Admittedly, since the random forest model is only trained on 5 minutes of audio data that was hand labeled, training it on a longer sampled data file would certainly yield more accurate results. Utilizing the random forest model we built, we got counts of 16609 false positives and 894 true positives for the unlabeled audio file. Using the sensitivity and specificity of the random forest, this amounts to 5376 NLF calls in an hour long audio file.
knitr::include_graphics("Screen Shot 2021-12-18 at 9.52.41 PM.png")
knitr::include_graphics("Screen Shot 2021-12-18 at 9.52.53 PM.png")
In conclusion, obtaining count data from an abundance of audio files turned out to be a very difficult process. While there are many ways to try to achieve these counts, template based detection appeared to have the most success. One huge issue with the template based detection method detailed in the report is that it is highly reliant on the frogs having shorter call durations. In other words, the method of cross correlation that this package uses does not work well with frogs who have longer durations of calls. This is primarily because Marcelo, the package author, developed Ohun with bird call detection in mind. In order to tackle this issue, either a different package would need to be used, or an audio file with low SNR would need to be utilized in order to possibly due energy based detection for these longer frog calls. The spring peeper frog species is an example of such a frog species, as it’s call lasts for the entire audio file. Many spring peepers will sing at the same time in a chorus, which makes it hard to pick out distinct calls throughout the file. We possibly could look into With more time, this would be an interesting avenue to look into. Given more time, the biggest area of improvement would be utilizing more hand labeled audio files in order to train our models. This would be huge, as it is currently the biggest hindrance towards generating accurate frog call counts. Utilizing more hand labeled data would allow us to observe how well our built models are doing in real time, and therefore allow us to make more optimized tweaks. This is a huge issue for us in the current state, as we are moving in the dark in terms of model accuracy due to no way to confirm or validate our results. Without being able to properly validate results, we are basically generating random numbers that cannot be verified to any degree of certainty. This hinders our ability for tweaks and adjustments that are vital in the model building process.Finally, with more time we would like to cross validate our model across multiple different sound files in order to find parameters for our model that generalize well. We would then use that validated model to come up with counts for the different species across all the audio files that we have in order to deliver our clients something that could be used for inferential statistical analysis about the differences between the plots that the frogs were recorded in.
All code for this project is in the repository linked below.
https://github.com/yermemian/Frog-Fellas-Forever
Marcelo’s Resources we utilized:
https://marce10.github.io/warbleR/reference/index.html